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PM:D I CATION FROM CONTINGENCY TABLES " . 

USING JOINT LIKELIHOODS" 

Lieutenant-CoramanoTer W.S. Shields 
Department of Military' Leadership And Management • 
Royal Mili^arS' College of Canada . 

A procedure .for predictiag categorical outcomes using catego- 
rical predictor variables was described by Moonan (fL972)., This 
paper, describes a related techrrflque' which uses prior probabfl- 
ities updated by joint likelihoods, as .classification criteria. 
The ^procedure differs from Mognan^s in that the outcome having 
the greatest posterior probability .is selected as the prediction 
regardless of misclassif icat ion cost. It also differs in me*thod 
of screening and \veighting the predictor variables, and'treats^ 
the problem of small-sample bias. Applicat ions to date, ane in 
the analysis and use'^of questionaair^^^responses to predict catego- 
rical outcomes, — namely volilntary, academic,, and military attri- , 
tion from a Servi^ce College, Classification ef f i-cp-ency* appeals 
to be comparable to that of ^he Moonan technique. 



Problem 
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The most common Source \>f categorical' data 'in .behavioral 
• research is the questionnaire. Whether questions are rnultiple- 
choice, or responses are grouped after-the-fact, it is usually 
difficult and frequently impossible to order responses along a 
metric scale. Even when an array of choices has a metric design, 
nonlinearit:y of relationships may ;make it preferable to treat 
i:esponses as qualitative. The usxial purpose of the* quest ion- 
' naire is to try to predict some cr?.terion variable. Like the 
predictor variables (questionnaire responses) the criteriion 
variable may be either quantitative (metric) or qualitative 
(categorical). 

If either the criterion variable or its predictors are 
met^ric, discriminant analysis or a related technique^ may be used, 
unless relationships are nonlinear. If neither is metric some 
form of categorical analysis must be used. One option i-s to 
"treat each category of' each variable as a' separate zero-one 
^ variable. The difficulty' with this is proliferation in the 
number of variables. If each question has 5 possible responses 
and one asks 100 questions, 500 variables result. These have so 
many possible interactions that*ones result ing* from errors of 
'measurement commonly dominate the • analysis . 

Moonan (-1972) pioneered'a strategy of analysis which treats 
the responses 'of a candidate to a selected subset of questions 
as a single unit and' calculates the Bayesian conditional probabil- 
ity of an outcome to be predicted, given the candidate's partic- 
ular response'profile, under an assumption of independence of the 
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questionnaire, items. A classification by outcome ^is then made 
which minimizes total misclassif iaation cost. 

' ' Method 



^ strategy reported here is related^ to the fii^t pg.rt of ' 
's procedure. It ^computes the joint likeli,hood of the. 



The 
Moonan 

candidate's responses 'under an /assuropti'on that a given outcome , 
will occur, and multiplies thi/S by the prior prpbability of the 

•outcome, xkus obtaining a quantity proportional to 'its posterior 
probability!. The outcome having the greatest posterior prob- 
ability is- then selected as. the prediction.' Because the joints 
likelihood of a given outcome, under, an kssuraption of independtence 
is the product of the individual, likelihoods derived from the ^ 
predictor questions for a given set of responses, one may use 

•^the sum of the log-likelihoods as a sufficient statistic. Thp 
logarithm of the prior probability is then ad'ded to this sum>nd ' 
the result comt)ared with that of other outcomes. 

B'ecauste the concept ^of "likellhoodF'^, is used more ofteA with 
continuous Vhan with categorical Vkri^-bles, it should be defined 
carefully here. Suppose that air entire popuLatiom could be 
-entered into a contingency table such^asthat in Figure 1. 
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•'■This procedure was prescribed by Jeffreys ( 193.9^.^*' - We wrote: 
(p. 29) "The posterior probabilities of the hypot]|esoS are' ; 
proportional to the products of the prior probabi|l,%t,ites and . t^ie 
likelihoods.'" Later (p. 133) he added "... when- est ilii.^tes 
are combined the part from the prior probability ^iitefB only 6nce, 
while that from the likelihood enters every time. " • 'v. : *\ 
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The Bi are responses to a question* and -the Aj are outcomes to 
be .predicted. ' The frequency in each cell is fij; the row totals 
are bi, the bplumn totals a j • The likelihood of outcome Aj given 
that a candidate has made response. Bi is fij/aj. This definition 
is in. accord with that- of Jeffreys, and differs from that, o.f R.A. 

' Fisher only in that the likelihood is required to eqical the 
probability of th^ event which has been observed, given th^^t the 

. hypothesis under consideration is true, and not merely be propor- 
' tional to it. Th/is distinction makes it possible to perfoYm any, 
operation on the elements of a column vector, ^when considering ^ 
likelihoods, that one would perform on a row vector when. con- 
sidering probabilities. 



^j^^^ Independence • - 

Before proceeding, some justification should be giveji for 
the assumption of independ:ence of the- categorical predictor 
variables. ' Moonan ( 1973 )V justifies it partly on the basis'of 
computational feasibility.. Certainly n^uch less computer storage 
is required under this assumption, because only contingency 
^tables .relating to the outcome to be predicted need be stored, 
rather than a set which includes every predictor versus every 
other pi>^ictor. More important than this, if joint likelihoods 
were to be inferred 'directly from the data, ^ the criterion sample 
would have to be extremely large, unless the^ aumber of questions 
were extremely smart, for a sufficient cell g'ontient in the multi- 
dimensional contingency, table that one could trust ^he results. 
For example, if 20 questions were to be used as predictors^ a 
cell content greater than unity would exist only if two persons 
' in the criterion/^ample had answered all,2p questions in exactly 
the same way, — a "rare event indeed. 

Another argument for the assumption of independence is *that 
"the consequences of failure of this condition .are rarely serious.. 
First Qf all, it is unusual for categorical variabj.es to be 
intercorrelated as highly as metric on^s. Comjnpjily, one' or two 
categories of a variable will correlate highly with one or two 
categories of another^variable, ('Trpvince of Birth in Canada'^ , 
versus '^Mother Tongue*\ for example) but overall correlations 
arQ usually low, unless one has accidentally (or ^ deliberately ) 
asked the same/question twice. 2 Secondly, the effect of such 
correlation. is merely to give some additional weight to highly 
correlat-ed questions. Questions asked twice, for exainp^le, ^ 
receive double weight i If this prejudices the prediction, it 
may do so in a favourable way because if a researcher has asked 
many questions centered in a particular area, 'or the same ques- 
tion in a number of different ways, it is presumably because he ^ 
believes this area or question to 'be important. 
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Moonan (1972) also used tiiese arguments. He wrote "... many 
qualitative characters in practical problems are likely to be 
nearly independent and/ their dependencies poorly estimated." 
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Selection of Predictors 



X is a convenient criterion of variable selection because* 
it is easily tested for significance. The author prefers the 
likelihood measure ^ 



over the Pearson x 
many 'as 50% of the 
blank rows and 
that -2 InX is 
The measure is 
theory : 



2 

X ^ "2 InX 

measure because of its stability when as 
contingency table cells — not counting entirely 
columns — are vacant. Wilks ('1928) demonstrated 
just as trustworthy as th€ i^earson 'approximatipn . 
identical to that derived from Shannon information 



\ 



X^ ^ 2N^ 



where ^ = A (A) + ll(B) - fl(A, B) 

= H (A) - J (A|B) 

= fi- (B) ~ ft -(B| A) 
and N is the sample size. ' A and B are two' categorical variables, 
and the uncertainties (entropies) are calculated in "nits'" 
using: 



(A)" = - ^!i_'ln J_ (B).= m ^ 

j N . N N N 



b. 
1 

N 



and 



fi (A, B) 



ij N 



In 



f . . 
N . 
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Having found a subset of predictors possessing^ significant ^ 
relationships with the outcome to be predicted, it is sugge^sted 
that the contribution of each predictor to the total log-likeli- 
bood be ^iven a weight proportional to Newman and Gerstman^s 
(1952) coefficient of constraint: 



D (A|B) = 



H (A) - A (AI'b) 



(A) " (A) . . ; / 

Giving inferior weight to questions of low relevance can be / / 
Justified on the basis of enhancing the "signal-to-noise" rati^b. 
D (A|B) is the relative reduction in the uncertainty of A.wJiejS'B 
is known, and is^ asymmetrical. If desired, D can be correcte^d 
for bias using formulas developed by Miller (1954) for^ correction. 



of fl and 

,H +;(K - 1)/2N- 



where K is the number, of occupied .cat*go 
T ^ - (R - 1) (C - 1)/2N 



^gories in t 



he variable, and 



where K is the number oJT" occupied rows, and C t 
occupi'ed columns in the contingency .tabl$. (K- 
will be-* recognized as the number of deg3?ees of 
to the test.v Negative values of T are s^t t 

Wh^n predictors are weighted proportional t 
.practice of. the writer to weight the prior dist 
with the strongest predictor. This amounts to 
-weight of unity, becaus e a constant factor will 
comparison of a number of posterior prob^biliti 
to weight the prior in this manner is supported 
it is based on a sample of the same size as is 
The prior is simply In (a./N). 

Bias 



he number of 
1) and'(R-l) (C-1) 
freedom appropriate 
o zero. 

o D or D, It IS the 
ribution pqually 
giviag each a 
not affect the 
es. The decision 
by the fact that 
each predictor^ 



Suppose k pell members occur out of a possible (column total) 
n. Xhe ^^maximum-likelihood" estimate of the likelihood is k/n. 
This estimate, although 'bias-fl'ee for addition, has negative bias 
for multiplication (addition of log-likelihoods). It was decided 
to seek a formula for multiplicative use which would be as free 
as possible o'f bias over a broad range of population likelihoods. 

If the sampling method resembles a Bernoulli process, and p 
represents the true population -likelihood, two consecutive cell 
frequencies r and r+l will have the same expectation when 



n! 



r!(n-r)! ^ 
or when p = 



n! 



(r+l)! (n-T-1)! 



rfl ,T .n-r-1 
.P (1-P) 



r+l 
n+1 



Thj^sa will also be the two most commonly 



observed cell frequencies, and will contain bias of opposite sign. 
If these biases are to be mutually cancelling then the likelihood 
estimates p should be such that » ♦ • 



Letting ^ have the form k+a the above becomes 

n+b ' — 



/ r+a\ /r+l+a\ . ^ /r+l\ ' 



Because r and' n will vary somewhat independently, we may set b=l 
and solve for a, obtaining: 
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a = l/r^+2r-+1.25 -r-.5 



Unfortunately, f is unknown. However a is very nearly .5 over a 
•broad* range of values of r and approaches .5 as r increases.- 'A 
test of t-he formula p = (k+.5)/(n+i) reveals that its performance 
for. large values of k is improved if it is modified slightly to: 
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„ _ k+,5 • I 

P - .TTT^ • 

This ha^ very little effect on its performance for sma'il values 
'of k. Table 1' compares the expected value of joint likelihoods, 
calculated from randpm samf).les using formula 1, with population 
values. The estimated likelihoods are seriously biased only for ^ 
small values of both p and n. In these instances sample likeli- 
hoods arev bound to be poorly estimated. The bias" is in the 
direction t)'f avoiding the i-ejection of a hypothesis purely o'n 
the basis of a very small sample. 

Formula 1 was adopted for estimating both likelihoods and 
prior probabili^es, and was found to perform slightly better 
than the tradit^nal k/n. 

Missing Data 

The method of this paper lends itself to a convenient and 
profitable treatment of missing data. One category of each 
variable is reserved for miss-ing responses (or outcomes , <^as^ 
Appropriate). The missing data category sometimes conveys 
important ^predictive information. In a study relating responses 
on a dental, questionnaire to clinical dental examinations, the 
response most indicative of an unsatisfactory oral environment 
was refusal (or inability) to answer some of the questions. . ~. 

Confidence 

Because a posterior log-proBability is calculated for each 
outcome, the difference in this quantity for the two most probable 
outcomes, less the difference in th6ir priors, equals thip log- 
arithm of their likelihood ratio. This lilcelihood ratio provides 
an excellent indication of the confidence with which a given 
predict ion^ can be made. ^ 

Test of. the Method 

The above procedure* was developed for predicting categorical 
outcomes of Cadet^ at a Military College, such as academic failure, 
voluntary resignation, military failure, achievement of distinc- 
tion, etc. based pn a questionnaire written on their first day 
at the College. A total of 596 Cadets from four Collegte years 
are currently under study. Tbeir graduation years^-range from 1974 
to 1978.. ' . ... 
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Rather than report ^results prematurely, the metbod will be 
demofl St rated instead using R.A. Fisher's (1950, p 32.180) Iris 
data. It is chosen partly because it has already been used by 
several researchers, including Moonah (1972), and is widely avail- 
able to future users for purposes of comparison. ^ 

Categories identical to those defined by Moonan were used in • 
grouping the data: , * 
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Moonan used all 150 Iris plants as both criterion and predic- 
tion samples., As he pointed out, this stacks the odds heavily ""in 
favour of the classification algorithm. His program produced 9 
misclassif icat ions of the- 150- Iris^ plants compared with 7 mis- 
classifications made by u^ng the same procedure with the method 
of this paper. The difference is nonsignificant; however, the 
result supports an opinion that the algorithms are of comparable 
quality. The m'ethod of this paper was tested also by shufflin 
the data cards and using the first 75 as a criterion sample from 
which to calculate prior probabilities and joint likelihoods for 
the classification of the remaining 75 plants* This resulted in 
69 correct classifications and 6 incorrect ones. Misclassi f ica- 
tions occurred only between tl'is versicolor and Iris virginica^ff 

Summary 

A procedure has been described which, like Moonan 's, represents 
a radical departure from conventional quest ionnaire analysis . The 
formula developed for making unbiased estimates of joint likeli-^ 
hood is equally applicable t6 the calculation of sample-derived 
joint probabilities. There are several areas in which the strat- 
egy is open to further refinement, particularly in regard to the 
^^independence" assumption. 

The comparison of hypotheses through the calculation of joint 
likelihoods is somewhat analogous to. putting them through a long*" 
filter.'' A single very low likelihood, anywhere along the length 
of the filter, 5can cause a hypothesis to become "clogged" and 
fall hopelessly behind its competitors. Consider, for example, 
the following flow of information and its effect 'on the likeli- 
hoods of three competing hypotheses: / 
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Hypothesis : 
Pfiformation Flow 

a. It has a head. 

b. It has eyes. 

c. It can fly.. 

d. rt has no feathers. 

e. It has no fur. 



Mammal Bird 



Fish 



Approximate Likelihood 



1.0 

1.0 
.bl 
.01 

-0 



1.0 
1.0 
.99 

0 . 



1.0 • 

1.0 
.001 
.001 
.001 



One can now classify the subject as a ''flying fish" with reason- 
able confidence. That the choice is ''unlikely" is not nearly as 
important as the fact that it is many times more likely than any 
of the available alternatives. The pro'cedure is well summarized 
(by a statement made by "Inspector Maigret*" on a radio mystery 
program by that name some two decades ago. Asked the secret of 
his uncanny success, the -famous detective replied: "Having 
eliminated all of the possibles, whatever remains - however 
improbable - must be the truth'?. 
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